%% ****** Start of file aiptemplate.tex ****** %
%%
%%   This file is part of the files in the distribution of AIP substyles for REVTeX4.
%%   Version 4.1 of 9 October 2009.
%%
%
% This is a template for producing documents for use with
% the REVTEX 4.1 document class and the AIP substyles.
%
% Copy this file to another name and then work on that file.
% That way, you always have this original template file to use.

\documentclass[aip,graphicx]{revtex4-1}
%\documentclass[aip,reprint]{revtex4-1}
\usepackage{graphics}
\usepackage[draft]{hyperref}
\usepackage{amsmath}
\usepackage{multirow}

\draft % marks overfull lines with a black rule on the right

\begin{document}

% Use the \preprint command to place your local institutional report number
% on the title page in preprint mode.
% Multiple \preprint commands are allowed.
\preprint{}

\title{Mesh-based Monte Carlo Method in Time-domain Widefield Fluorescence Molecular Tomography}

\author{Jin Chen}
\affiliation{Department of Biomedical Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USA}

\author{Qianqian Fang}
\affiliation{Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, MA 02129, USA}

\author{Xavier Intes}
\email[]{intesx@rpi.edu}
\affiliation{Department of Biomedical Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USA}

% repeat the \author .. \affiliation  etc. as needed
% \email, \thanks, \homepage, \altaffiliation all apply to the current author.
% Explanatory text should go in the []'s,
% actual e-mail address or url should go in the {}'s for \email and \homepage.
% Please use the appropriate macro for the type of information

% \affiliation command applies to all authors since the last \affiliation command.
% The \affiliation command should follow the other information.

%\email[]{Your e-mail address}
%\homepage[]{Your web page}
%\thanks{}
%\altaffiliation{}
% Collaboration name, if desired (requires use of superscriptaddress option in \documentclass).
% \noaffiliation is required (may also be used with the \author command).
%\collaboration{}
%\noaffiliation

\date{\today}

\begin{abstract}We evaluated the potential of mesh-based Monte Carlo method for widefield time-gated fluorescence molecular tomography, aiming to improve accuracy in both shape discretization and photon transport modeling in preclinical settings. An optimized software platform was developed utilizing multi-threading and distributed parallel computing to achieve efficient calculation. We validated the proposed algorithm and software by both simulations and {\em in vivo} studies. The results establish that, the optimized mesh-based Monte Carlo method is a computationally efficient solution for optical tomography studies in terms of both calculation time and memory utilization. The open source code, as part of a new release of ``MMC'', is publicly available at http://mcx.sourceforge.net/mmc/.
\end{abstract}

\pacs{}% insert suggested PACS numbers in braces on next line

\maketitle %\maketitle must follow title, authors, abstract and \pacs

% Body of paper goes here. Use proper sectioning commands.
% References should be done using the \cite, \ref, and \label commands
\section{Introduction}
\noindent
Optical imaging has been the subject of intense investigation over the past decades, largely due to the fact that light at visible and near-infrared wavelengths is a non-ionizing, non-invasive probe with numerous applications in medicine \cite{Ntziachristos2005,Gibson2005}. In fluorescence molecular tomography (FMT), an emerging optical imaging technique for preclinical applications, one can solve for 3D \textit{in vivo} fluorescent marker distribution maps using only non-contact surface optical measurements and tomographic image reconstructions \cite{Mansfield2010, Hielscher2005}. This makes FMT a suitable method for small animal research where fluorophores are designed to label the drugs of interest, and enables the tracking of their delivery process. A successful tomographic reconstruction in FMT typically requires the forward modeling of photon transport inside optically complex tissue. The diffusion equation, an approximation to the more general Radiative Transport Equation (RTE), is usually employed as the forward model for simplicity. However, this approximation becomes invalid when modeling low-scattering and highly absorbing tissue, resulting in potentially inaccurate quantification in small animal imaging where specimens exhibit wide variation in optical properties with possible presence of void regions (lung, bladder for instances) \cite{yoo1990}. Moreover, the diffusion equation cannot accurately represent the propagation of short light pulses in tissue for the portion of photons arriving at the surface early~\cite{xu2002,sakami2002}, which has been shown an effective datatype to improve the resolution in reconstruction when no {\em a priori} information is available \cite{Niedre2008,Chen2010,Wu1997}.

The Monte Carlo (MC) method is an accurate light propagation modeling approach in dealing with general media such as those in small animals and early photons \cite{Wang1995, Ma2007}. This method approximates the RTE solution via random sampling of large numbers of photons, thus it is often used as the gold standard for modeling time-resolved light propagation for all optical properties encountered in these scenarios. Nevertheless, MC-based techniques suffer from low computational efficiency (hours/days of computation) \cite{Boas2002} because numerous photon simulation trials are required to obtain satisfactory statistics. When time-domain data are considered, the photons are split into small time-bins, which leads to an increase in the number of photons to be simulated to achieve reasonable statistics per time window, and make calculations even more time-consuming. Only recently our group developed a computationally efficient MC-based reconstruction technique for FMT utilizing time-gated data sets \cite{Chen2011}. This voxel-based Monte Carlo (vMC) method allows for computation of functional and fluorescent Jacobians for whole-body tomography within a couple of hours. However, difficulty arises when applying the vMC algorithm, as the commonly employed cubic shape voxels in regular 3D grid cannot accurately simulate the curved boundaries. One remedy is to raise the voxel density, but doing so drastically increases the memory and computational burden. This burden is further amplified when using widefield illumination strategies \cite{Chen2010}, where photons interact with a much larger boundary area compared to the conventional punctual light source scheme. Therefore, an MC algorithm that has flexibility in representing the arbitrary domain shape is highly desired.

In this work, we present an optimized software platform to solve the inverse problem in FMT with spatially extended arbitrary sources. The novelty of this approach lies in the accurate boundary representation and rapid calculation when employing the mesh-based Monte Carlo (mMC) method \cite{fang2010}. Based on a currently available mMC code \cite{mmc}, an extended hybrid parallel version was implemented to model arbitrary illumination patterns. A comparison was performed between this method and our previously developed vMC method in terms of quantitative accuracy and computational efficiency for whole-body tomographic reconstruction in time-domain. This method was then validated and evaluated by small animal full-body simulations and experiments in tomographic settings.

\section{Method}
\subsection{\label{sec:meth_mmc}Widefield mesh-based Monte Carlo method}
The mMC method utilizes fast ray-tracing techniques to accelerate calculation and is capable of simulating point illumination on complex geometry \cite{fang2010}. Here the widefield illumination was developed based on its version 0.8 release as provided at \cite{mmc}. In order to simulate a realistic widefield source with spatial intensity variance, the illumination patterns were represented as 2D-grid (1mm$^2$ pixel) images with an intensity value assigned at each grid element (Fig.~\ref{fig:flow}A). The initial positions of the total photons were uniformly distributed over the illumination image by using the uniform random number generator. Any photon falling into one pixel had the pixel's intensity value assigned as its initial packet weight, allowing for arbitrary illumination settings (Fig.~\ref{fig:flow}B). A simulated photon was then projected along the z-axis and entered the mesh at an intersection point on the mesh surface. To identify the surface injection point, we applied a ray-tracing step to obtain its 3D coordinates by testing all surface triangles using a Havel-Herout ray-tracing algorithm~\cite{Havel2010}. In case multiple intersection points were identified (Fig.~\ref{fig:flow}b and \ref{fig:flow}c), the intersection with the shortest distance to the initial position (Fig.~\ref{fig:flow}b) was selected as the injection point on the surface. After a photon's injection point was determined, its propagation in the media was simulated following the same rules as in classical MC techniques, until it exited the surface or the total time of interest was reached.

\begin{figure}
\includegraphics{widefmmc-1}
\caption{\label{fig:flow}The illustration for generating the widefield illumination using a digital mouse phantom. A: arbitrary illumination pattern; B: photons projected to the surface; C: surface region of interest; a: initial position of a photon; b: the actual injection point of this photon; c: a potential injection point of this photon (discarded since the distance to the initial position is longer than b).}
\end{figure}

Unfortunately, complete surface triangle test of this additional ray-tracing step imposed a significant computational overhead, particularly when the surface mesh was dense. For example, the calculation time for the ray-triangle intersection test could be 50x more than that for the photon propagation on a mouse model. In order to accelerate the computation time for the widefield method to match the punctual illumination method, a few algorithmic optimizations were implemented, particularly considering the fact that the $x-y$ coordinate of the injection position is dependent on the photon's initial position, while the $z$ coordinate is dependent on the intersection of the photon and the surface. First, we confined the initial positions to the region of interest~(ROI) and the surface triangles in the ROI were picked out (Fig.~\ref{fig:flow}c, the dark gray surface). Second, using the projected triangles on the x-y plane, a subset of triangles was selected and stored in memory. For any launched photon, a culling technique \cite{wald2004} was then implemented to confine ray-triangle tests only to the triangles with their bounding boxes containing the photon's $x-y$ coordinate. These operations reduced redundant calculations thus allowed for acceleration in computational time.

\subsection{Inverse model for reconstruction}
The 3D distribution of a fluorophore's effective quantum yield $\eta (r)$ can be obtained by solving a linear system relating the fluorescence signals at time $t$ and $\eta(r)$:
\begin{equation}
\label{eq:mesh_fluo_linear}
U_F (r_s,r_d, t)=\int_{\Omega} W(r_s, r_d, r, t) \eta (r) dr,
\end{equation}
where $U_F(r_s,r_d,t)$ is the fluorescence detected by a detector located at $r_d$ at $t$ resulting from an excitation at the source $r_s$ at $t_0=0$, the integration domain $\Omega$ is defined as the entire imaging volume, and $W(r_s,r_d,r,t)$, referred to as the weight function or Jacobian, describes how sensitive a change in $\eta(r)$ will result in a change in $U_F(r_s,r_d,t)$. There are a few MC-based methods to calculate $W$, and based on the comparison between these different methods, we selected the forward-adjoint MC method to generate time-gated Jacobians. It allowed for computing Jacobian in an efficient manner \cite{Chen2011a}, thanks to the smaller dataset acquired with widefield illumination strategies \cite{Chen2010, Belanger2010}. In this method, $W$ is computed by convolving the Green's functions (the light propagation for impulse sources) and the lifetime decay \cite{Chen2011}:
\begin{equation}
\label{eq:fluo_conv}
W(r_s,r_d,r,t)= \int_0^t e^{- (t-t')/\tau} dt'\int_0^{t'} G^x(r_s, r, t'-t'')G^m(r,r_d, t'')dt'',
\end{equation}
where $G_x$ and $G_m$ are the time-dependent background Green's functions calculated by mMC simulations at excitation and emission wavelength, respectively \cite{Arridge1999}, and $\tau$ is the lifetime of the fluorophore. The system of linear equations representing the measurements detected at different positions and time can then be solved to obtain the fluorophore distribution.

\subsection{Computational settings and efficiency evaluation}
\label{sec:sim_setting}
\noindent
For forward simulations, we incorporated the widefield related computation in the mMC software platform. In addition to the previously implemented multi-threading and single-instruction multiple-data (SIMD) parallelism, we further enhanced the modeling efficiency by integrating Message Passing Interface (MPI) based parallel computing technique with the code. This allows one to run mMC in a distributed memory system, such as a high-performance cluster. To utilize the maximum computational power of platforms with both multi-threads and multi-nodes, an adaptive hybrid parallelization method using the communication protocol MPI and OpenMP APIs was implemented. The MC approach is highly parallelizable, that is, the large number of photons can be broken up into smaller sets and calculated independently. In terms of distribution, the MPI protocol had photons equally divided and distributed to all nodes, while the OpenMP protocols dynamically set the assignment of photons for the threads on a node. The seed for the random number generator was assigned based on the thread and node ID. A reduce operation was performed to sum up the result to a single node (master node) after all nodes finished computing.

In this particular work, the computational efficiency of the hybrid parallel code was tested using Single/Dual Core CPUs (BlueGene/Opteron on CCNI, RPI). To estimate the scalability of multi-threading and multi-CPU calculation, quantitative metrics were computed to measure the performance of the parallel programs. We define the speedup as the ratio of the execution time on a single processor (the sequential version) to that on a multi-node cluster. This ratio is defined as:
\begin{equation}
\label{eq:speedup}
S(n) = \frac{t_s}{t_p(n)},
\end{equation}
where $n$ is the number of threads/processes used in the parallel simulation, $t_s$ is the execution time on a single-thread processor and $t_p$ is the execution time on a parallel computer. $S(n)$ therefore describes the scalability of the system as the number of threads/processes increases. The efficiency was computed as:
\begin{equation}
\label{eq:effi}
E(n)=\frac{t_s}{n\times t_p(n)},
\end{equation}
which describes the fraction of the time that is being used by the processors for the computation.

\subsection{Accuracy comparison}
\label{sec:meth_accuracy}
\noindent
Due to a wider illuminated surface, the simulation result of widefield illumination can be more sensitive to errors of boundary modeling than conventional point illumination used in optical tomography. To quantitatively assess the performance of mMC and vMC dealing with different types of illumination on complex models, a simulation using high resolution voxel-based model was first set up as the reference. The light propagation profiles using both a mesh-based model and a re-sampled low resolution voxel-based model with the same number of tetrahedral elements/valid voxels~(voxels that are not air) were then computed.

To mimic a pre-clinical tomographic reconstruction, the high-resolution Digimouse model (0.1mm$^3$) generated from CT data~\cite{Dogdas2007} was employed to create the mesh model and low resolution voxelized model~(Fig.~\ref{fig:accuracy_set}). The software package Iso2mesh \cite{Fang2009a} was used to convert the volumetric model to a mesh model containing 4075 node, 22379 tetrahedron elements. The corresponding low-resolution voxel-based model had a comparable number of 22433 valid voxels resulting in 1mm$^3$ voxels, which is the typical voxel size in preclinical applications. Notice here the mesh model was homogeneously tesselated with no denser sampling around the boundary or internal organs to assure an impartial comparison with the homogeneously voxelized model. The illumination pattern covered a 5-7cm area with full coverage of the transverse plane. A cross-sectional plot is shown in Fig. ~\ref{fig:accuracy_set}b for mesh- and voxel-based model, respectively. In all simulations above, the numbers of photons were kept at $10^8$ according to ref.\cite{Chen2011a}. The optical properties were set to $\mu_a = 0.3$cm$^{-1}$, $\mu_s = 15$cm$^{-1}, g = 0.9, n= 1.37$, which are typical values for mouse tissue in the NIR spectral region in the literature \cite{Alexandrakis2005}. The time profiles were recorded with a 300ps gate width and 20ps time shift between the gates to replicate the RPI imaging platform for optimal experimental performance~\cite{Venugopal2010}.
\begin{figure}[ht]
\centering
\includegraphics{widefmmc-2}
\caption{\label{fig:accuracy_set} (a) The high resolution mouse atlas employed in the accuracy comparison. (b) The selected slice in (a) for boundary visualization overlapped with the mesh model and the low resolution voxel model. The red line shows the boundary of the high resolution CT image, while the black and blue lines represent that of the mesh- and low resolution voxel model, respectively.}
\end{figure}

The time-gated profiles of light propagation using the mesh- and high-resolution models were scaled down to the size of the low resolution profile, then the percentage error of mesh- and low-resolution models in relation to the high-resolution model was calculated to access the accuracy quantitatively. Further evaluation of the methods were conducted by calculating the overall root mean square difference (RMS difference) between the results using the two methods and high resolution standard:
\begin{equation}
\label{eq:RMS}
RMS(t) = \sqrt{\frac{\sum_{i=1}^N(p(t)-p_0(t))^2}{\sum_{i=1}^N{p_0(t)^2}}},
\end{equation}
where $N$ is the number of voxels in the low-resolution voxel model, $t$ is the time, $p_0$ is the high resolution result and $p$ is that calculated by low resolution mMC or vMC.

\subsection{Tomographic reconstruction settings}
\subsubsection{Simulation reconstructions}
The above mentioned models were utilized for a full tomographic reconstruction study. The high-resolution mouse model was employed to generate fluorescence measurements with the pre-segmented kidneys labeled with simulated fluorescent markers (effective quantum yield 0.1; $\tau$ = 0.8ns). 64 sliding half-space illumination patterns and 64 point detectors were simulated within the abdomen section (x: 50-70mm, y: 8.5-28.5mm, 400 pixels for each pattern) using 512 single-thread nodes on BlueGene. The detectors were evenly spanned over the region of interest with a 2.857mm separation (Fig.~\ref{fig:FMT_set}a). A 25\% rising gate (early gate) was selected and the Born normalization \cite{Ntziachristos2001} of the corresponding time point spread functions (TPSFs) was employed as the datatype for reconstruction herein. The conjugate gradient method was used to solve the inverse problem and the iteration was stopped when the relative error was less than 0.02.

\begin{figure}[ht]
 \centering
 \includegraphics{widefmmc-3}
 \caption{\label{fig:FMT_set}
 (a) Mesh-based mouse model with kidneys depicted in red. The patch of black and red shows one sample scanning pattern out of a total of 64. The black areas indicate where the photon source is off, while the red area shows where it is on. The black circles at the bottom represent the detectors. (b) The segmented boundary of CT-scanned images from the {\em in vivo} experiment overlayed with the half-space illumination patterns and normalized Born measurements (Video\_1, MPEG, 1.61 MB). The red box and the black circles show the entire illuminated area and the detector positions, respectively.}
 \end{figure}

\subsubsection{Experimental validation}
\noindent The performance of this method was also evaluated with experimental data for FMT. The data were acquired from an \emph{in vivo} experiment using time-resolved widefield tomography platform (the details of the system can be found in~\cite{Venugopal2010}). The system employed a tunable femtosecond laser as the source and a time-gated ICCD camera as the detector. A pico-projector module was used for source pattern generation allowing for rapid acquisition of spatially and temporally dense measurements. The experiment was performed on a freshly euthanized mouse with a 13mm long tube with 1mm diameter placed in the thoracic cavity~(Fig. \ref{fig:exp_result}a) \cite{Venugopal2010a}. The tube contained 14$pmol$ of IRDye 800CW R(LiCOR) in 1$\mu L$ ethanol. 64 sliding bar-shaped illumination patterns were projected over a 33mm$\times 18$mm area (594 pixels to represent each pattern) on the abdomen with 97 point detectors (0.5mm$^2$ areas) measuring the transmitted excitation and emission photons. The detectors were empirically selected from the high-resolution CCD camera images over the area where fluorescence signals were detected, with denser detectors at the highly fluorescing region. We acquired a profile covering 4.6ns with 40ps offset between two successive gates. The animal was then subsequently scanned by Micro CT (Scanoco Viva CT40) and the CT images were registered with the optical platform to locate the fluorescent tube.

The surface geometry of the region of interest ($40mm\times 31mm$) was extracted to generate a homogeneously tesselated model with 92,713 elements and 15,581 nodes for the mesh-based weight matrix calculation. The entire model was assigned the average background properties of the small animal acquired by spectroscopic fitting of the excitation measurements ($\mu_a = 0.3$cm$^{-1}$ and $\mu_s' = 25$cm$^{-1}$). The results of the forward simulations were stored in both node-based format (photon package weights are accumulated at each node) and element-based format (the result is saved for each element) to calculate the weight matrix. The number of nodes was only 16.8\% of that of elements thus the node-based format had advantages in both storing space and speed for data processing. The reconstructions employing weight matrices using the two formats were compared, to assess the possible loss of accuracy of using node-based format. The early-gate at 25\% of the maximum value was selected in reconstruction similarly to the synthetic study. For comparison, a reconstruction employing the same measurement dataset on a voxel-based model with 92,831 non-air $0.5mm^3$ voxels was performed.

\section{Results}
\subsection{Computational efficiency}
\noindent
Our implementation allows the generation of spatially complex illumination patterns over arbitrary oriented surfaces to accurately model non-contact widefield illumination strategies. After the optimizations mentioned in Sec.~\ref{sec:meth_mmc}, the overall computational overhead of simulating extended time-resolved sources in this complex scenario was only marginally larger (~5-10\%) over a point source configuration, while providing much greater functionality.

Fig.~\ref{fig:speedup}a shows the speedup of mMC calculation while using different threading settings with $10^7$ and $10^8$ photons. For simulations using a larger number of photons, the speedup curve has a marked improvement toward the ideal curve, implying higher efficiency ($E(16)=93.2\%$ for $10^7$ photons and $E(16)=99.3\%$ for $10^8$ photons with single-thread computation setting). The difference between the three threading methods is minimal ($<3\%$) when the same number of photons is employed, with pure threading being the closest to ideal. However, the current generation of multi-core hardware has a limitation on total number of threads thus limits the speedup potential of pure threading technique.

Fig.~\ref{fig:speedup}b depicts the time cost for mMC and vMC using different number of processes under single-thread parallel settings while having similar numbers of elements/valid voxels. Under all available processes settings (128-4,096), mMC remains roughly 4-5 times faster than vMC, while for large number of processes (4,096) the program approaches its scaling bottleneck.
\begin{figure}[ht]
 \centering
 \includegraphics{widefmmc-4}
 \caption{\label{fig:speedup}(a) The speedup using widefield mMC with $10^7$ and $18^8$ photons on single-thread nodes, 4-thread nodes and an 8-thread node; (b) the time cost for mMC and vMC on single-tread nodes.}
 \end{figure}

\subsection{Accuracy of time-domain Jacobians}
Fig.~\ref{fig:accu_forward}a and \ref{fig:accu_forward}b shows the forward photon propagation profiles at y = 60mm for the full field and point illumination, respectively. As expected, mMC fits the boundaries much better compared to vMC for widefield case, while both mMC and vMC matches well to the standard for point source case.  Fig.~\ref{fig:accu_forward}c presents the TPSFs for these two illumination settings at the same detector indicated in Fig.~\ref{fig:accu_forward}a and \ref{fig:accu_forward}b. Both the results using vMC and mMC methods fit the standard well for the point illumination, while the TPSFs for widefield illumination result in increased error for vMC compared to mMC, especially around the rising part (early gates) of the TPSF. A comparison of the different datatype extracted from the time-resolved measurements at these two detectors are listed in table~\ref{tab:tpsf}, with the Continuous Wave (CW) measurement being a percentage error and the other parameters as the absolute errors in time. The error in the $1^{st}$ moment of TPSF (CW measurements) reflects the integrated intensity difference and that in the $2^{nd}$ and $3^{rd}$ moments (mean flight time and variance) represent a change in position and shape of the TPSF. Overall, the mMC results are more accurate than the vMC ones, while the inaccuracy at the side detector (detector 1) is greater than that at the center detector (detector 2), due to the fact that the side detector is closer to the edge thus more affected by the mismatched boundary area. Comparing the widefield and point illumination, the error is greater in the widefield case as expected, as a result of an illumination pattern probing larger boundary area where mismatch boundaries exist.

\begin{table}[!h]
%        \tabcolsep 0pt
        \caption{Error table comparing the low resolution vMC and mMC with the standard. The CW measurement error is in percentage, the others are in absolute time.}
    \label{tab:tpsf}
    \vspace*{-12pt}
    \centering
    \def\temptablewidth{1.0\textwidth}
    {\rule{\temptablewidth}{1pt}}
    \begin{tabular}{rccccccccc}
    &&\multicolumn{2}{c}{CW(\%)}&\multicolumn{2}{c}{Tmax (ps)}&\multicolumn{2}{c}{Mean flight time (ps)}&\multicolumn{2}{c}{Variance (ps)}\\
    & &Det 1 & Det 2 &Det 1& Det 2 &Det 1& Det 2 &Det 1& Det 2\\
    \hline
    \multirow{2}{*}{Widefield} &mMC&-1.29 &-0.81&40&20&20.81&15.94&-3.93&-2.19\\
    & vMC& -3.35&1.99&80&0&67.83&18.81& -8.51&-3.00\\
    \hline
    \multirow{2}{*}{Point} &mMC& -1.62&-0.38&0&20& -6.96&20.93&1.55&-3.43\\
    & vMC& 2.48&0.73&0&20&21.12 &18.81 &-3.30&-3.00\\
        \end{tabular}
    {\rule{\temptablewidth}{1pt}}
\end{table}

\begin{figure}[ht]
 \centering
 \includegraphics{widefmmc-5}
 \caption{\label{fig:accu_forward}Forward simulation comparison for (a) widefield and (b) point source illumination. mMC:blue, vMC:grey; high resolution standard (std): solid color filled contours~(background).(c) The normalized time-resolved detector readings (black crosses and circles in a and b). The solid and dashed lines represent the TPSFs in the point source and widefield simulations, respectively. In the bottom figure the dashed and solid lines are overlapped due to similar detector responses for simulations with the point source and the widefield source.}
\end{figure}

The contours of continuous wave Jacobians using both the mMC and vMC methods are shown in Fig.~\ref{fig:accu_result}a and \ref{fig:accu_result}b for widefield and point illumination, respectively. In Fig.~\ref{fig:accu_result}c and \ref{fig:accu_result}d, the Jacobian contour comparison at the 25\% rising gate is displayed. In both cases, contours using the mMC and vMC methods match the reference contour well for the point-point SD pair. However, the contour using the mMC method fits the reference much better than that using the vMC method in a widefield-point SD setting, especially around the boundaries where the voxel model experiences increased mismatch to the high resolution model. An average RMS difference of 10.8\% and 9.4\% is achieved in the point-point SD pair Jacobians using vMC and mMC, respectively, while the RMS difference is 23.5\% and 13.4\% for the widefield-point SD pair. The values may be compared to a baseline RMS difference of 2.4\% and 3.8\%, obtained using two high resolution simulations with $10^9$ photons and different random seeds. The Jacobian accuracy comparison in time-domain is shown in Fig.~\ref{fig:accu_result}c. For the point-point SD pair, the RMS differences using the two methods do not experience significant change with respect to the changes in gate position. However, for the widefield case, the mMC has a considerably lower RMS difference when compared to the vMC for early gates.

\begin{figure}[ht]
 \centering
 \includegraphics{widefmmc-6}
 \caption{\label{fig:accu_result}The contour comparison of Jacobians using mMC and vMC for (a, c) point and (b, d) widefield illumination in (a,b) continuous wave and (c, d) time gated modes. The contours filled with solid colors are from the high resolution simulation. (e) The RMS comparison in time-domain for four cases and all time gates.}
\end{figure}


\subsection{Simulation reconstruction performance}
The reconstruction using mesh-based MC methods is shown in Fig.~\ref{fig:recon_result}. The kidneys are accurately reconstructed in regard to their original discretization. The maximum reconstructed value for the mesh-based MC is 91.0\% of the expected value. The centroids of the two kidneys have position error of 1.02mm and 0.78mm, respectively. The maximum reconstructed value for the mMC method is 3.4\% more accurate toward the expected value compared to the vMC method. The run time for creating the time-domain mesh-based Jacobian was 3.47h for $64\times64$ SD pairs (in total 4,096) on 512 nodes.

\begin{figure}[ht]
 \centering
 \includegraphics{widefmmc-7}
 \caption{\label{fig:recon_result}The simulation reconstruction result. (a) The 50\% isovolume of the reconstruction. (b) 5 slices from x = 50mm to x = 70mm with a separation of 5mm.}
 \end{figure}

\subsection{Experimental comparison}
The 50\% isovolume of the reconstructed effective quantum yield using the mMC node-based results is shown in Fig.~\ref{fig:exp_result}a. This isovolume has a maximum length of 11.3mm and mean diameter of 2.7mm. Moreover, sample slices of cross-sections overlaid on the corresponding CT slices demonstrates the accurate localization of the inclusion. The element-based (92,713 elements) and voxel-based (15,581 nodes) reconstructions are also shown for comparison in Fig.~\ref{fig:exp_result}b and ~\ref{fig:exp_result}c, respectively. The node-based and element-based weight matrices result in very close reconstructions in both quantification ($<2$\% difference in maximum reconstructed value) and position of the inclusion. The centroid position error of the 50\% isovolume using vMC is 2.83mm while that using mMC is 0.56mm.

\begin{figure}[ht]
 \centering
 \includegraphics{widefmmc-8}
 \caption{\label{fig:exp_result}(a) 3D reconstruction showing the 50\% isovolume of the reconstructed object (blue) and the fluorescent vessel (red). Axial and frontal slices at z = 6.5mm and y = 21.5mm for (b) node-based, (c) element-based and (d) voxel-based reconstructions, respectively.}
 \end{figure}

\section{Discussion}
In this study, we successfully implemented the widefield illumination for calculating the propagation of light through arbitrary 3D media in mesh-based models. After implementing an optimized ray-test step for determining the injecting point for each photon, computational time for widefiled illumination simulations becomes only marginally longer (5-10\%) than point-source simulations. This is a remarkable improvement compared to the 50 times increase when no optimization was employed in the widefield illumination.

Moreover, the parabolization of the code is a significant advancement of the computation capacity. The aim of developing a mixed mode MPI / OpenMP code was to utilize the full computation availability of single symmetric multiprocessors (SMPs) or a cluster of SMPs, and achieve a performance improvement over a pure MPI code. As expected, when the number of threads/processes is the same, the most efficient parallelization implementation is using pure multi-threading (OpenMP). Although the number of threads on a single SMP is restricted at current time (usually less than 8), larger systems of single SMP could become available in the foreseeable future with 64-bit memory addressing. The efficiency comparison of hybrid code with the pure MPI implementation reveals that a performance improvement is achieved: the overall computational time has reduced and the scaling of the code with increasing thread number is better. Hence a combination of MPI and OpenMP parallelization paradigms provides more efficient parallelization strategy. In addition, the hybrid code makes full use of the memory on each node due to the shared memory mechanism, allowing for the possibility of running larger-scale programs. However, the improvement in time is quite small since the pure MPI parallelization algorithm for forward MC method is optimized and scales very well already, with no significant impairment due to load imbalance or memory limitation. Overall, the three parallelization implementations have excellent scalability with almost linear acceleration when parallelized platform is available. Thus when more threads/processes are utilized, the faster the simulation will be.

%% accuracy in wide field and time domain
The widefield implementation enables the ability to perform simulations with spatial variation in illumination intensity, which is essential for the emerging pattern light tomography applications \cite{Chen2010,Joshi2006, dutta2010,DAndrea2010}. Compared to the conventional punctual illumination, the widefield light illumination results in more errors on the boundaries where the model mismatch exists due to the larger illumination area, in both the forward propagation and Jacobians. When the time gated information is considered, the RMS difference comparison represents a significant improvement in accuracy for mMC at early gates, yet with an acceleration of $\sim$5x compared to vMC. This is of importance as the early gates are not modeled accurately by diffusion equation while the incentive of using early gates to increase spatial resolution in FMT is rising. However, this accuracy improvement in Jacobian does not necessarily relates to the reconstruction performance (13.2\% discrepancy in the RMS differences for Jacobian at the selected time gate, 3.4\% difference in the maximum reconstructed value comparing the mMC and vMC reconstruction). This can be explained by the fact that the disparity mainly exists around the boundaries while the expected fluorescent objects are closer to the center.

%% limitations
One of the limitations of the current implementation is that it does not model the time difference when the photons from the source plane project to the boundary. However, according to the light propagation velocity in air, the maximum error in time for the typical specimen thickness (around 2cm) in preclinical applications is $\scriptsize{\sim}$3ps, which is negligible compared to the minimum time interval of 20ps for the time-resolved platform. Another potential difficulty for \emph{in vivo} assessment may resides in the availability of anatomical atlas, as high resolution imaging modalities are required to obtain accurate surfaces for mesh-based reconstructions.

%% mesh optimization availabilty
As a few examples, this code can be used for investigating the fluorophore distribution with time-gated data type while serving as the forward solver for MRI guided FMT. In this work, the mesh model for the forward and reconstruction processes was confined in homogeneous tissue and uniformly tessellated; however, it is noteworthy that it can also present in highly heterogeneous medium with arbitrary internal/external boundaries while retaining a small number of nodes and elements, thanks to its flexibility. The representation of mesh-based models for high resolution 3D anatomic images can be further optimized (eg. reducing the nodes while keeping the geometry unchanged) to benefit the accuracy of the reconstructions.

%% voxel->node mesh, memory taken
The rationale of similar results using node and element based weight matrix is that these methods render essentially the same spatial profiles of photon propagation, though the number of nodes for both cases are much less than that of tetrahedron elements (15,581 nodes compared to 92,713 elements). Furthermore, in the ill-posed inverse problem, the effective information mainly relies on the measurements but not the image-space spatial distribution when the number of measurements (dependent on the number of SD pairs and time-gates selected) is much less than that of the unknowns. Casting nodes as the unknowns in the inverse problem leads to a much smaller unsolved linear system \cite{Yalavarthy2008}, which is superior as the memory constraint is usually a burden in reconstructions, especially when the time-resolved data are employed to provide extra information. Due to this advantage in memory utilization, employing nodes instead of elements also allows for the use of denser source-detector pairs, which is crucial for improving resolution of the reconstructed objects.

%% DOT
In addition, the formulation for diffuse optical tomography (DOT) using the MC method is readily available for the adjoint method, thus the mMC method can be easily applied to DOT to quantitatively determine the functional parameters (blood volume and hemoglobin oxygenation) related to the optical properties as well \cite{Chen2009a}. As FMT/DOT is becoming frequently used in combination with high-resolution imaging methods such as MRI, CT, X-Ray and ultrasound \cite{Azar2008}, mMC can serve as an effective tool to incorporate high-resolution structural information into FMT/DOT with limited spatial resolution for quantitative functional and molecular studies.

\section{Conclusion}
This work focuses on assessing the feasibility and evaluating the computational performance of the widefield mMC algorithm for time-resolved FMT, and the improvements on the quality of models with complex boundaries. We demonstrate that the mMC method is a computationally efficient solution for tomographic use when modeling of general media is required (about 5 times faster than vMC). Simulations on a complex digital mouse phantom establish that both mMC and vMC match well with the high resolution vMC output in temporal and spatial profiles for punctual SD settings; however, for widefield illumination, the accuracy for mMC is improved compared to that for vMC. Moreover, with the recent progress of MC simulation using GPU\cite{Alerstam2008b, Fang2009}, it is expected that a full reconstruction can be finished with a typical personal computer in the time frame comparable to the classic DE based models. The current version of this code is available for download as MMC v1.0 at http://mcx.sourceforge.net/mmc/, and more features may be added, such as a mesh-based way to represent the illumination patterns.

% If in two-column mode, this environment will change to single-column format so that long equations can be displayed.
% Use only when necessary.
%\begin{widetext}
%$$\mbox{put long equation here}$$
%\end{widetext}

% Figures should be put into the text as floats.
% Use the graphics or graphicx packages (distributed with LaTeX2e).
% See the LaTeX Graphics Companion by Michel Goosens, Sebastian Rahtz, and Frank Mittelbach for examples.
%
% Here is an example of the general form of a figure:
% Fill in the caption in the braces of the \caption{} command.
% Put the label that you will use with \ref{} command in the braces of the \label{} command.
%
% \begin{figure}
% \includegraphics{}%
% \caption{\label{}}%
% \end{figure}

% Tables may be be put in the text as floats.
% Here is an example of the general form of a table:
% Fill in the caption in the braces of the \caption{} command. Put the label
% that you will use with \ref{} command in the braces of the \label{} command.
% Insert the column specifiers (l, r, c, d, etc.) in the empty braces of the
% \begin{tabular}{} command.
%
% \begin{table}
% \caption{\label{} }
% \begin{tabular}{}
% \end{tabular}
% \end{table}

% If you have acknowledgments, this puts in the proper section head.
\begin{acknowledgments}
This work was supported by National Institutes of Health (NIH) grant R21 EB013421 and by the National Science Foundation (NSF) grant CBET-1149407. Qianqian Fang wishes to thank the funding support from the Bill \& Melinda Gates Foundation (\#OPP1035992).
\end{acknowledgments}

% Create the reference section using BibTeX:
\bibliography{../../thesis/thesis}

\end{document}
%
% ****** End of file aiptemplate.tex ****** 